Skip to content

fix: preserve features in IterableDataset.add_column (fixes #5752)#8115

Open
IgnazioDS wants to merge 1 commit intohuggingface:mainfrom
IgnazioDS:fix/streaming-add-column-loses-features
Open

fix: preserve features in IterableDataset.add_column (fixes #5752)#8115
IgnazioDS wants to merge 1 commit intohuggingface:mainfrom
IgnazioDS:fix/streaming-add-column-loses-features

Conversation

@IgnazioDS
Copy link
Copy Markdown

Summary

Fixes #5752.

IterableDataset.add_column() lost the dataset's feature schema on every call because it delegated to map() without passing the features argument. map() unconditionally writes info.features = features, so the default None silently overwrote the existing schema.

Root cause (iterable_dataset.py:3459):

info = self.info.copy()
info.features = features  # None when add_column() doesn't pass features

Fix: infer the new column's Arrow type via the existing _infer_features_from_batch helper, merge it with the existing features, and pass the result to map().

new_features = None
if self._info.features is not None:
    column_features = _infer_features_from_batch({name: list(column)})
    new_features = Features({**self._info.features, **column_features})
return self.map(..., features=new_features)

If the dataset had no features to begin with (_info.features is None), behaviour is unchanged — map() receives None as before.

Test plan

  • Reproduce the original issue: create a streaming dataset with known features, call .add_column(), assert .features is not None and contains the new column
  • Verify existing test_iterable_dataset.py tests still pass
  • Smoke-test with numpy array column input (not just list)

…ce#5752)

map() unconditionally sets info.features to its `features` parameter,
which defaults to None. add_column() called map() without passing
features, so every add_column() call wiped out the existing schema.

Fix: infer the new column's Arrow type and merge it with the existing
features before delegating to map(), so the returned dataset retains
the full schema including the newly added column.
Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ! I just added a suggestion, lmk what you think before we merge

# Without this, map() would set info.features=None (its default), losing all schema info.
new_features = None
if self._info.features is not None:
column_features = _infer_features_from_batch({name: list(column)})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could help avoid unnecessary data copies, since _infer_features_from_batch does copy all the data into arrow

Suggested change
column_features = _infer_features_from_batch({name: list(column)})
column_features = _infer_features_from_batch({name: list(column[:config.DEFAULT_MAX_BATCH_SIZE])})

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming dataset looses .feature method after .add_column

3 participants